detection and recognition
Fare: Failure Resilience in Learned Visual Navigation Control
Wang, Zishuo, Loo, Joel, Hsu, David
Abstract-- While imitation learning (IL) enables effective visual navigation, IL policies are prone to unpredictable failures in out-of-distribution (OOD) scenarios. We advance the notion of failure-resilient policies, which not only detect failures but also recover from them automatically. F ailure recognition that identifies the factors causing failure is key to informing recovery: e.g. We present F are, a framework to construct failure-resilient IL policies, embedding OOD-detection and recognition in them without using explicit failure data, and pairing them with recovery heuristics. Real-world experiments show that F are enables failure recovery across two different policy architectures, enabling robust long-range navigation in complex environments. Visual navigation is an attractive approach to robot navigation, leveraging rich visual information from low-cost sensors [1]. Imitation learning (IL) has emerged as a key method to learn visual navigation policies [2]-[4], but is inherently limited by training data. IL policies may fail unpredictably on inputs outside the training distribution, often without clear explanation [5]-[7]. This work develops a mechanism to enable IL policies to detect and recover from failures, supporting robust open-world navigation.
Sign Language: Towards Sign Understanding for Robot Autonomy
Agrawal, Ayush, Loo, Joel, Zimmerman, Nicky, Hsu, David
Navigational signs are common aids for human wayfinding and scene understanding, but are underutilized by robots. We argue that they benefit robot navigation and scene understanding, by directly encoding privileged information on actions, spatial regions, and relations. Interpreting signs in open-world settings remains a challenge owing to the complexity of scenes and signs, but recent advances in vision-language models (VLMs) make this feasible. To advance progress in this area, we introduce the task of navigational sign understanding which parses locations and associated directions from signs. We offer a benchmark for this task, proposing appropriate evaluation metrics and curating a test set capturing signs with varying complexity and design across diverse public spaces, from hospitals to shopping malls to transport hubs. We also provide a baseline approach using VLMs, and demonstrate their promise on navigational sign understanding. Code and dataset are available on Github.
PatrolVision: Automated License Plate Recognition in the wild
Adoption of AI driven techniques in public services remains low due to challenges related to accuracy and speed of information at population scale. Computer vision techniques for traffic monitoring have not gained much popularity despite their relative strength in areas such as autonomous driving. Despite large number of academic methods for Automatic License Plate Recognition (ALPR) systems, very few provide an end to end solution for patrolling in the city. This paper presents a novel prototype for a low power GPU based patrolling system to be deployed in an urban environment on surveillance vehicles for automated vehicle detection, recognition and tracking. In this work, we propose a complete ALPR system for Singapore license plates having both single and double line creating our own YOLO based network. We focus on unconstrained capture scenarios as would be the case in real world application, where the license plate (LP) might be considerably distorted due to oblique views. In this work, we first detect the license plate from the full image using RFB-Net and rectify multiple distorted license plates in a single image. After that, the detected license plate image is fed to our network for character recognition. We evaluate the performance of our proposed system on a newly built dataset covering more than 16,000 images. The system was able to correctly detect license plates with 86\% precision and recognize characters of a license plate in 67\% of the test set, and 89\% accuracy with one incorrect character (partial match). We also test latency of our system and achieve 64FPS on Tesla P4 GPU
DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training
Xie, Yu, Qiao, Qian, Gao, Jun, Wu, Tianxiang, Huang, Shaoyao, Fan, Jiaqing, Cao, Ziqiang, Wang, Zili, Zhang, Yue, Zhang, Jielei, Sun, Huyang
More and more end-to-end text spotting methods based on Transformer architecture have demonstrated superior performance. These methods utilize a bipartite graph matching algorithm to perform one-to-one optimal matching between predicted objects and actual objects. However, the instability of bipartite graph matching can lead to inconsistent optimization targets, thereby affecting the training performance of the model. Existing literature applies denoising training to solve the problem of bipartite graph matching instability in object detection tasks. Unfortunately, this denoising training method cannot be directly applied to text spotting tasks, as these tasks need to perform irregular shape detection tasks and more complex text recognition tasks than classification. To address this issue, we propose a novel denoising training method (DNTextSpotter) for arbitrary-shaped text spotting. Specifically, we decompose the queries of the denoising part into noised positional queries and noised content queries. We use the four Bezier control points of the Bezier center curve to generate the noised positional queries. For the noised content queries, considering that the output of the text in a fixed positional order is not conducive to aligning position with content, we employ a masked character sliding method to initialize noised content queries, thereby assisting in the alignment of text content and position. To improve the model's perception of the background, we further utilize an additional loss function for background characters classification in the denoising training part.Although DNTextSpotter is conceptually simple, it outperforms the state-of-the-art methods on four benchmarks (Total-Text, SCUT-CTW1500, ICDAR15, and Inverse-Text), especially yielding an improvement of 11.3% against the best approach in Inverse-Text dataset.
Enhanced Review Detection and Recognition: A Platform-Agnostic Approach with Application to Online Commerce
Karmakar, Priyabrata, Hawkins, John
Online commerce relies heavily on user generated reviews to provide unbiased information about products that they have not physically seen. The importance of reviews has attracted multiple exploitative online behaviours and requires methods for monitoring and detecting reviews. We present a machine learning methodology for review detection and extraction, and demonstrate that it generalises for use across websites that were not contained in the training data. This method promises to drive applications for automatic detection and evaluation of reviews, regardless of their source. Furthermore, we showcase the versatility of our method by implementing and discussing three key applications for analysing reviews: Sentiment Inconsistency Analysis, which detects and filters out unreliable reviews based on inconsistencies between ratings and comments; Multi-language support, enabling the extraction and translation of reviews from various languages without relying on HTML scraping; and Fake review detection, achieved by integrating a trained NLP model to identify and distinguish between genuine and fake reviews.
Intelligent Robotic Control System Based on Computer Vision Technology
Che, Chang, Zheng, Haotian, Huang, Zengyi, Jiang, Wei, Liu, Bo
Computer vision is a kind of simulation of biological vision using computers and related equipment. It is an important part of the field of artificial intelligence. Its research goal is to make computers have the ability to recognize three-dimensional environmental information through two-dimensional images. Computer vision is based on image processing technology, signal processing technology, probability statistical analysis, computational geometry, neural network, machine learning theory and computer information processing technology, through computer analysis and processing of visual information.The article explores the intersection of computer vision technology and robotic control, highlighting its importance in various fields such as industrial automation, healthcare, and environmental protection. Computer vision technology, which simulates human visual observation, plays a crucial role in enabling robots to perceive and understand their surroundings, leading to advancements in tasks like autonomous navigation, object recognition, and waste management. By integrating computer vision with robot control, robots gain the ability to interact intelligently with their environment, improving efficiency, quality, and environmental sustainability.
Detecting and recognizing characters in Greek papyri with YOLOv8, DeiT and SimCLR
Turnbull, Robert, Mannix, Evelyn
Detecting and recognizing characters in Greek papyri with YOLOv8, DeiT and SimCLR Robert Turnbull, Evelyn Mannix First place in character recognition challenge Second place in character detection challenge Best recall and precision results for detection and recognition results for IoU 0.5 Releasing prediction results in multiple formats for 4500+ Oxyrhynchus Papyri images Abstract The capacity to isolate and recognize individual characters from facsimile images of papyrus manuscripts yields rich opportunities for digital analysis. For this reason the'ICDAR 2023 Competition on Detection and Recognition of Greek Letters on Papyri' was held as part of the 17 We used an ensemble of YOLOv8 models to detect and classify individual characters and employed two different approaches for refining the character predictions, including a transformer based DeiT approach and a ResNet-50 model trained on a large corpus of unlabelled data using SimCLR, a self-supervised learning method. Our submission won the recognition challenge with a mAP of 42.2%, and was runner-up in the detection challenge with a mean average precision (mAP) of 51.4%. At the more relaxed intersection over union threshold of 0.5, we achieved the highest mean average precision and mean average recall results for both detection and classification. We ran our prediction pipeline on more than 4,500 images from the Oxyrhynchus Papyri to illustrate the utility of our approach, and we release the results publicly in multiple formats.
Horizontal Federated Computer Vision
Mandal, Paul K., Leo, Cole, Hurley, Connor
In the modern world, the amount of visual data recorded has been rapidly increasing. In many cases, data is stored in geographically distinct locations and thus requires a large amount of time and space to consolidate. Sometimes, there are also regulations for privacy protection which prevent data consolidation. In this work, we present federated implementations for object detection and recognition using a federated Faster R-CNN (FRCNN) and image segmentation using a federated Fully Convolutional Network (FCN). Our FRCNN was trained on 5000 examples of the COCO2017 dataset while our FCN was trained on the entire train set of the CamVid dataset. The proposed federated models address the challenges posed by the increasing volume and decentralized nature of visual data, offering efficient solutions in compliance with privacy regulations.
Enhancing Vehicle Entrance and Parking Management: Deep Learning Solutions for Efficiency and Security
Ramzan, Muhammad Umer, Ali, Usman, Naqvi, Syed Haider Abbas, Aslam, Zeeshan, Tehseen, null, Ali, Husnain, Faheem, Muhammad
The auto-management of vehicle entrance and parking in any organization is a complex challenge encompassing record-keeping, efficiency, and security concerns. Manual methods for tracking vehicles and finding parking spaces are slow and a waste of time. To solve the problem of auto management of vehicle entrance and parking, we have utilized state-of-the-art deep learning models and automated the process of vehicle entrance and parking into any organization. To ensure security, our system integrated vehicle detection, license number plate verification, and face detection and recognition models to ensure that the person and vehicle are registered with the organization. We have trained multiple deep-learning models for vehicle detection, license number plate detection, face detection, and recognition, however, the YOLOv8n model outperformed all the other models. Furthermore, License plate recognition is facilitated by Google's Tesseract-OCR Engine. By integrating these technologies, the system offers efficient vehicle detection, precise identification, streamlined record keeping, and optimized parking slot allocation in buildings, thereby enhancing convenience, accuracy, and security. Future research opportunities lie in fine-tuning system performance for a wide range of real-world applications.
UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding
Feng, Hao, Wang, Zijian, Tang, Jingqun, Lu, Jinghui, Zhou, Wengang, Li, Houqiang, Huang, Can
In the era of Large Language Models (LLMs), tremendous strides have been made in the field of multimodal understanding. However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and rich world knowledge inherent to these large pre-trained models, and the beneficial connections among tasks within the context of text-rich scenarios have not been sufficiently explored. In this work, we introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities, which are deficient in existing approaches. Moreover, UniDoc capitalizes on the beneficial interactions among tasks to enhance the performance of each individual task. To implement UniDoc, we perform unified multimodal instruct tuning on the contributed large-scale instruction following datasets. Quantitative and qualitative experimental results show that UniDoc sets state-of-the-art scores across multiple challenging benchmarks. To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.